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There  are  a  few  algorithms  designed  to  solve  the  problem  of  the 

optimal  alignment  of  one  sequence,  the  pattern,  of  length  m,  with 

another,  longer  sequence  the  text,  of  length  n.  These  algorithms 

allow  mismatches,  deletions  and  insertions.  Algorithms  to  date 

run  in  0{mn)  time.  Let  us  define  an  integer,  k,  which  is 

the  maximal  number  of  differences  allowed. 

We  present  a  simple  algorithm  showing  that 

sequences  can  be  optimally  aligned  in  0(J(p-n)  time. 

For  long  sequences  the  gain  factor  over  the 

currently  used  algorithms  is  very  large. 

1.   Introduction 

The  problem  of  searching  for  homologies  between  sequences  has  been  repeatedly 
addressed  during  the  last  few  years.  With  the  explosive  growth  of  nucleotide  (and  amino 
add)  sequence  data,  there  has  been  an  urgent  need  for  efficient  algorithms  which  would 
compare  the  sequences.  Although  here  we  shall  focus  on  the  nucleotide  sequence 
comparisons,  it  is  evident  that  the  same  algorithms  can  be  used  to  compare  protein  sequences 
as  well. 

Our  algorithm  (1,2)  is  most  efficient  in  cases  where  we  want  to  compare  one  sequence, 
the  "pattern",  of  length  m  to  another  sequence,  the  "text"  of  length  n,  and  n  is  much  Izu-ger 
than  m.  Such  situations  arise  often  when  a  sequence  is  much  longer  than  the  pattern 
sequence,  or,  if  we  align  the  pattern  with  several  (or  many)  other  sequences,  whose 
combined  lengths  would  then  be  n.  This  happens  when  we  want  to  find  the  site  of  the 
optimal  alignment  between  a  section  from  one  sequence  and  another,  or  when  we  compare 
one  gene  with  several  related  genes,  or,  with  possible  candidates  for  pseudogenes.  We  may 
also  wish  to  compare  the  degree  of  potential  relatedness  between  introns,  or  repetitive 
sequences. 

A  simple  approach  to  finding  good  alignments  was  devised  by  Tinoco  et  al  (3)  and  by 
Maizel  and  Lenk  (4).  It  constructs  an  mXn  matrix  and  scores  matches  of  minimal  lengths. 
The  resulting  diagonals  are  then  visually  inspected.  For  homologous  sequences  an  elongated, 
nearly  consecutive  set  of  diagonals  is  observed.    Often,  however,  the  viewer  is  uncertain 


which  set  of  diagonals  to  choose.  Nepdleman  and  Wunsch  (5)  devised  an  elegant  algorithm 
for  finding  an  optimal  alignment  for  which  the  number  of  matches  as  penalized  by 
insertion/deletion  gaps  is  maximal.  Sankoff  (6),  Sellers  (7)  and  Dumas  and  Ninio  (8)  have 
further  extended  this  clgorithm  by  minimizing  the  measure  of  sequence  discrepancy  or  by 
maximizing  the  measure  of  similarity. 

Korn  et  al  (9)  have  written  a  program  for  finding  good  local  alignments.  Sellers  (10, 
11)  has  also  treated  tlie  local  alignment  problem  and  devised  an  algorithm  for  finding  within 
any  pair  of  sequences  all  locally-close  subsequences.  Goad  and  Kanehisa  (12)  have 
introduced  into  his  procedure  a  better  measure  of  the  quality  of  the  detected  alignments.  A 
weight  has  been  assigned  to  each  deletion  or  mismatch  type  and  a  search  for  all  subsequences 
in  which  the  density  of  the  differences  is  less  than  a  previously  set  value  is  carried  out.  The 
Goad-Kanehisa  program  is  probably  the  most  sensitive  method  to  locate  all  local  homology 
alignments  between  two  sequences.  A  much  faster,  approximate  procedure  has  been  devised 
by  Wilbur  and  Lipman  (13).  Nussinov  has  presented  an  efficient  code  searching  for  global 
DNA  homology  (14)  which  is  based  on  a  maximal  matching  algorithm  (15,  16).  Fickett  (17) 
has  greatly  improved  the  Seller's  algorithm  by  disregarding  computationally  irrelevant  steps. 

To  date,  the  algorithms  used  run  in  0{mn)  time.  We  define  an  additional  parameter,  k. 
k  is  the  largest  number  of  differences  (i.e.  mismatches,  insertions,  or  deletions)  allowed  in 
the  alignments  of  the  pattern  and  the  text.  The  algorithm  presented  here  uses  0(krn)  time. 
In  addition  to  the  large  economy  in  time,  the  memory  requirements  of  this  algorithm  are 
greatly  reduced  as  well.  Whereas  in  the  classical  method  (3-17)  the  memory  requirement  is 
0(mn),  here  the  required  memory  is  0(m"  +  it-). 

We  shall  next  describe  two  algorithms  for  finding  the  minimal  number  of  differences 
between  two  strings,  proceed  to  outline  the  new  algorithm  and  end  with  a  review  of  the 
biological  implications  of  such  algorithms. 

2.    An  Algorithm  for  Finding  The  Minimal  Number  of  Differences  Between  Two  Strings 

Suppose  we  have  two  strings  (i.e.  nucleotide  sequences):  R  =  r,r2  '  '  ■  r„  and 
B  =  bjb2  ■  ■  ■  b„.  In  our  case  each  string  is  composed  of  an  alphabet  of  four  letters, 
A,C,G,T.  On  each  of  the  strings  there  are  three  types  of  possible  editing  operations  (i.e. 
mutations): 

(i)  deleting  a  character  from  position  i  to  yield  R  -  rj-^  ■  ■  ■  r^./j+j^  '  '  '  ''m' 
(ii)  inserting  a  character  z  to  yield  R  =  r-j-^  ■  •  •  '■/^'■/  +  i  ''"''„; 

(iii)  changing  one  character  to  yield  R  =  t^t^  ■  ■  ■  Ti_^zti^^  '  '  ■  r„.  These  editing  operations 
have  been  penalized.    The  problem  is,  then,  what  is  the  minimal  cost  of  editing  operations 


that  will  transform  sequence  R  into  sequence  B.   This  cost  is  denoted  D{R,B). 

To  simplify  the  presentation  each  difference  between  the  two  sequences  is  assigned  a 
penalty  of  one  unit,  i.e.  Cch  =  C^e  =  C/tv  =  1,  where  Cch^  Cqe  and  Cn,-  are  the  costs  of 
changing,  deleting  or  inserting  nucleotides.  Now  the  cost  of  transforming  sequence  R  into 
sequence  B  is  simply  the  count  of  the  minimal  number  of  differences  involved. 

In  matrix  D  sequence  R  is  stretched  along  the  3'  axis  (i  =  1,  .  .  .  ,m)  and  sequence  B 

0  =  1 n)   is  along  the  j:-axis.   The  D,j   entry  contains  the  attempted  matching  of 

fj^  •  •  ■  r,  and  b^  ■  •  ■  b,.  Say,  we  want  to  further  inspect  entries  in  the  matrix.  We  can 
continue  either  by  walking  along  a  row  (/  -  i  +  1),  by  walking  along  a  column  0'  -;'  +  1), 
or  by  increasing  both  parameters  at  the  same  time  (i  -  i  +  IJ  ->  j  +  1)  and  sliding  along  a 
diagonal  d. 

Each  of  the  above  steps  has  biological  implications.  Progressing  along  a  row  implies 
that  the  text  (sequence  B)  has  some  nucleotides  which  the  pattern  (sequence  R)  lacks.  That 
is,  sequence  B  contains  an  insertion.  Walking  along  a  column  in  the  D  matrix  implies  that  B 
lacks  some  nucleotides  which  are  present  in  R,  i.e.,  there  is  a  deletion  in  the  text. 
Continuation  along  a  diagonal  involves  considering  longer  sequences  with  either  a  match  or  a 
mismatch. 

It  is  clear  that  advancing  along  a  diagonal  consisting  of  elements  Dij  with  j  -  i  =  d  is 
preferable  to  advancing  along  a  row  or  a  column.  Consider  nucleotides  r,-  and  bj,  with 
r,  ^bj.  Continuing  along  only  one  sequence,  say  B,  we  may,  at  best,  match  the  previously 
unmatched  last  nucleotide  of  B.  That  is,  r,j.|  =  bj  and  r^  is  looped  out  of  the  alignment. 
Suppose  we  continue  further  along  the  column  and  consider  now  the  pair  r,+2)  ^j-  ^^r  the 
purpose  of  obtaining  an  optimal  aligrmient  of  R  and  B  it  is  immaterial  whether  /-,+•  and  bj 
match  or  not.  K  r,+2  ^  bj,  then  the  previous  alignment  is  kept  and  r^+j  dangles  at  the  end. 
If  r,a.2  =  bj,  then  we  may  keep  this  match,  but  now  in  addition  to  r  r^.^  is  looped  out  as 
well.  This  state  of  affairs  is  bound  to  continue.  Either  the  loop  grows  and/or  the  unmatched 
dangling  end  grows.  The  same  situation  holds  for  going  along  a  column,  i.e.  keeping  r, 
constant  and  increasing  b.  The  entry  Djj  in  the  matrix  is  then  the  minimal  cost  of  the  editing 
operations  required  to  transform  r^  •  •  •  r,  to  6^  •  •  •  iy  (see  Fig.  1). 

The  following  algorithm  computes  the  matrix  Oro,...^;o..,,/ii 

Initialization      Dq  0  '•-  0- 

for  all  i,  !■<.  j  ■<.  n  ,  Dqj  :=  ;'  ,. 
for  all  i,  I  ^  i  ■^  m  ,  D,  q  :=  i  ,. 

for  i:—  1  to  m  do 

for  j:=  1  to  n  do 

Dij:=    min    {D^.^j^  C^e,    D,j_^+  C[^,    Di_,j_y    if    r,  =  bj    or   D,_^j.^+  Cch 
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otherwise) 

(£>, ,  is  the  minimum  of  three  numbers.    These  numbers  are  obtained  from  the 

predecessors  of  D,^  on  its  column,  row  and  diagonal,  respectively). 

The  computation  will  continue  until  D„^  has  been  reached.  Filling  the  matrix  would 
thus  require  Oipin)  time.  In  order  to  get  the  actual  alignment,  a  backtracking  procedure  is 
carried  out  by  following  a  minimizing  path  from  D„^  to  Dqq.  Variants  of  this  basic  method 
have  been  used  by  Needleman  and  Wunsch,  Sankoff,  Sellers,  Goad  and  Kanehisa,  Nussinov, 
Fickett  and  Wagner  and  Fisher  (19). 

3.    An  Improved  Algorithm  for  Finding  The  Minimal  Number  of  Differences  Between  Two 
Strings 

Suppose  we  have  two  strings  with  k  being  the  largest  nimiber  of  differences  (insertions, 
deletions  and  changes)  allowed  between  them.  Obviously,  this  algorithm  handles  only  cases 
where  n-m  <  k.  Namely,  R  =  r^-  •  •  r„  and  B  =  by-  ■  ■  b,„+i^.  In  the  previous  section  we 
showed  an  algorithm  which  finds  their  optimal  alignment  in  O(m-)  time.  Here  we  give  the 
improved  cilgorithm  which  reaches  the  same  result  in  0{mk)  time  (18). 

As  we  have  noted  in  the  previous  section,  computation  of  elements  in  the  matrix  which 
are  far  from  the  central  diagonals  are  not  going  to  figure  in  the  final  optimal  alignment  of  the 
sequences.  This  suggests  that  a  more  efficient  algorithm  can  be  derived  by  limiting  the 
computation  and  considering  only  the  central  diagonals. 

For  consecutive  elements  along  a  diagonal,  the  difference  D^j  -  Di^^j^y  is  either  zero 
or  one.  This  allows  efficient  storage  of  the  information  in  the  Dij  matrix.  For  each  diagonal 
d  of  the  matrix,  it  suffices  to  store  the  information  which  specifies  the  point  (z,  j  =  i  +  d)  on 
d  where  the  D,j  values  increase. 

Let  k  be  the  maximal  number  of  differences  allowed  between  R  and  B .  For  a  number 
of  differences  e  (e  <  ik)  and  a  diagonal  d,  let  L^ ,  denote  the  largest  row  /  such  that 
Di[j.^  =  e  This  definition  implies  that  between  the  subpattern  r^-  •  •  r^  and  the  subtext 
^i  '  '  '  ^L  ^rf>  there  are  e  differences.  (Note:  L^_,  +  i  is  the  corresponding  j.  L^  ^  is  the 
maximal  i  that  satisfies  our  requirements.  The  maximal  i  +  d  yields  the  j  parameter,  since 
j  -  i  =  d).    This  definition  also  implies  that  r^    +y  ^  b^    +^+1.  Otherwise  L^^,  +  1  rather 

than  L^  ,  would  have  been  the  maximal  row  i  with  the  same  number  of  e  differences. 

If  some  Lj ,  with  e  -^  k  equals  m,  we  can  reach  the  end  of  the  pattern  (row  number  m 
in  the  matrix)  without  inoirring  more  than  k  differences.  In  this  case,  and  only  in  this  case, 
the  answer  to  the  originjil  question  "does  R  occur  in  B-  sttu-ting  at  fe^  -  with  at  most  k 
differences"  is  "yes".   This  knowledge  of  the  L^  ^  suffices  for  our  purpose. 
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A  basic  feature  of  the  D  matrix  is  that  all  elements  D,,+^  (^(,/-d)  on  the  upper 
(lower)  d'th  diagonal  necessarily  exceed,  or  are  equal  Xo,.d.  Now,  e  in  Z,^,  is  just  a 
particular  entry  along  this  diagonal.  Hence,  by  definition,  \d\  ^  e.  We  also  restrict  our 
attention  to  e  <  it,  i.e.  at  most  it  differences,  and  conclude  therefore  that  \d\  ^  k.  We  can 
thus  focus   on  the  central  2Jt  +  1   diagonals:  the  main  diagonal,   i  =  j    (cf  =  0),   the  it 

diagonals  above  it,  J  =  1 k  and  the  ;t  diagonals  below  it,  d  =  -1 -it.   The  time 

requirement  for  the  computation  is,  then,  0(km),  rather  than  0(m")  as  in  the  case  for  the 
standard  algorithm. 

How  do  we  compute  L^ ,?  Using  the  values  of  Z,^,.-!,  Z.rf_i,,_i  andL^^.^,.!  a  variable 
row  is  initialized  and  increased  by  one  unit  at  a  time  till  it  hits  the  correct  value  of  L^^. 

The  0{km)  algorithm  is  as  follows: 

1.  ford:=-(k+I)to{k+l)do 

if  d  <  0  -         ■ 

then   I,rf_[^|_i:=  |d|  -1  ; 

2.  for  e:  =  0  to  fc  do 

for  d:  =  -  e  to  e   do 


row  =  max 


Ld,-i  +  1 

^d-i,e-l 
Ld^l,e-i  +    1 

4.  while  r,^+i  =  Z>,^+ixd  do 

row:=  row  +  1 

5.  Lrf ,  =  row 

6.  If  Z.^,  =  m 

then  print  'yes*  and  Stop. 

1.  The  values  assigned  in  the  first  section  of  the  initialization  procedure  have  been 
selected  in  a  way  such  that  they  will  not  interfere  with  the  calculation  of  L^  ,'s. 

2.  The  limits  of  the  loops  have  already  been  explained:  e  ^  k  and  \d\  ^  e. 

3.  row  is  initialized  by  using  three  diagonals:  d,  d  -  1  and  d  +  1.  On  each  of  these 
the  largest  row  containing  the  e  —  1  value  is  taken.  We  do  not  add  one  unit  to  the  L^_|  j_i 
term  since  we  are  already  on  a  lower  diagonal  and  thus  we  move  along  the  same  row. 

4.  We  continue  on  the  same  diagonal  d  as  long  as  the  nucleotide  in  the  R  sequence  is 
the  same  as  its  counterpart  on  the  B  sequence  and  thus  the  number  of  differences  e  does  not 
increase.  When  these  two  nucleotides  differ,  we  set  step  5.  In  step  6,  the  end  of  sequence  R 
has  been  reached.  Since  e  -^  k,  an  appropriate  match  has  been  found. 

An  example  is  given  in  Figure  2.  Note,  however,  that  it  is  not  stored  in  the  computer  as 
such.  A  matrix  is  produced  here  only  for  the  convenience  of  the  reader.  It  suffices  to  store 


the  necessary  L"s.  This  comment  holds  true  for  Figures  3a,b,c,d  as  well. 

4.  The  Efficient  String  Matcblog  Alcrritfem  With  it  Dii^erences 

We  next  describe  the  new  algorithm.  First  we  build  a  table  by  analyzing  the  pattern. 
Then  we  examine  the  text  from  left  to  right  checking  possible  occurrences  with  respect  to  one 
starting  location  (in  the  text)  at  each  iteration.  Besides  the  tables  built  in  the  pattern  analysis, 
the  input  to  each  iteration  consists  of  the  knowledge  acquired  in  previous  iterations.  The 
rightmost  location  in  the  text  to  which  we  have  arrived  in  previous  iterations  is  of  particular 
significance.  Each  iteration  consists  of  manipulating  this  knowledge.  U  necessary,  we  proceed 
to  investigate  the  text  to  the  right  of  this  rightmost  location.  (As  examples  use  figures  2,  3a, 
b,  c,  d). 

The  first  part  of  the  algorithm  is  tlie  pattern  analysis.  The  output  of  the  pattern  analysis 
is  a  two  dimensional  array  MAXLENGTH[0,...,m-lfi,...,m-l].  MAXLENGTH{iJ)  =  f 
means  that  r,+i,  .  .  .  ,r^^y  =  rj+i,  .  .  .  .rj^f,  and  r,+y+i  #  ''j+f+i-  This  means  that  if  we  lay  a 
section  of  the  pattern  starting  at  r,+i  over  another  section,  starting  at  Tj^^,  f  is  the  largest 
exact  repeat  of  the  two  sections.  This  exact  repeat  will  end  at  r,+^  in  the  first  section  and  at 
r,4.y  in  the  second.  The  pattern  analysis  takes  O(m')  time.  It  is  exceedingly  simple.  Fig  4 
presents  an  actual  example  and  its  caption  gives  a  brief  description  of  the  computation. 

Let  us  now  return  to  the  text  analysis  which  consists  of  n  — m  +  it+1  iterations  (i.e  from 

0  to  n-m  +  yk).    At  iteration  i  we  check  for  an  occurrence  with  ^k  differences  of  the  pattern 

starting  at  fc;+w  Let  bj  be  the  rightmost  nucleotide  in  the  text  that  was  reached  first  at  some 

prior   iteration,   say   the  /  th  (0  :s  /  <  /).     Since  we  look  for  matches  with    at  most  k 

differences  between  the  pattern  and  some  section  of  the  text,  there  are  necessarily  ^  k 

differences  between  b,^i,  .  .  .  .b,  and  the  portion  of  the  pattern  matched  with  it. 

One  example  is  repeatedly  given  in  the  figures  (2,  3a,  b,  c,  d).    Let  us  here  give  an 

additional    example.    Let    r.,  .  .  .  .r-j    be    ACTACTTTCCGAG    and    let    b^-,  .  .  .  ,b-^Q    be 

AGCTACTTGTCCAG  (with  iteration  /  =  16  , ;  =  30).  The  correspondence  is: 

1  2    3    4       5    6    7  8    9   10  11   12  13 

R|A  CTA      CTT  TCCGAG 

I  I    I    I      I    I    I        Ml        II 

B|A      GCTA      CTTGTCC         AG 

17     18  19  20  21     22  23  24  25  26  27  28        29  30 
bi+i  i,+i  bj 

The  correspondence  gives  k  =  3  differences.  Hence  there  are  also  ^k  differences  between 
some  portion  of  the  pattern  and  bj^.^,  .  .  .  ,bj  (i  -  20).  Let  us  call  this,  r^,  .  .  .  .r^j,  portion 

the  subpattern.    Thus,  for  some  correspondence  between  the  nucleotides  of  fe,^, b/  (i. 

e.    6-1'  •  ■  •  -^30)    ^^^    ^^    subpattern,    there    are    at    least  j-i-k    (30-20-3)    matched 
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nucleotides.  It  is  easy  to  see  that  all  the  nucleotides  of  the  text  that  have  successive  matches 
with  the  subpattern  form  at  most  k+1  successive  substrings.  For  each  substring  of  the  text  we 
know  its  corresponding  substring  in  the  pattern.  Wlien  a  substring  of  the  text  b^^^.  .  .  .  .bp+^ 

matches    a    substring    of    the    pattern    r^^^ r^+^    (e.g.    p  =  18  ,/  =  6  ,  and  c  =  1    , 

ri+i ri+6  =  b^s+i-  •  •  •  •^is+e)  and  6^+/+i  ^  /-c+z+i.  denote  this  by  the  triple  {p,cj) 

(e.g.  (18,1,6)).  In  the  section  ft,+]^,  .  .  .  ,bj  there  are  at  most  k  unmatched  and/or 
mismatched  nucleotides.  Each  of  these  b^  nucleotides  is  denoted  by  the  triple  (A ,0,0)  (in  our 
example  h  =  17  ,  b^  +  i  =  i^g  =  G). 

Thus,  the  substring  of  interest  in  the  text,  b,+j^,..,bj,  (^21.  •  •  •  .^30)  breaks  up  into  0{k) 
consecutive  matched  or  unmatched  elements  as  follows:  (20, 3, 4), (24, 0,0), (25,7, 3), (28, 11, 2). 
These  triples  are  denoted  as  Sij  (52o,3o)' 

In  order  to  find  the  best  alignment  during  iteration  i  we  always  use  the  algorithm 
described  in  section  3  to  guide  us.  In  addition,  however  we  use  the  output  of  MAXLENGTH 
(pattern  analysis)  and  Sij,  the  output  of  a  previous  iteration. 

In  iteration  i  the  pattern  r^,  .  .  .  ,r„  is  put  on  fc,+;,  .....  Returning  to  our  example, 
for  i  =  20  we  have: 

1       2    3    4     5     6    7    8    9    10  11   12  13 
R|  ACTACTTTCCGAG 

B|AGCT     A      CTTGTCCAGAAG... 

21     22  23  24  25  26  27  28  29  30  31  32  33  ... 

Iteration  1  consists  of  the  0(km)  algorithm  of  section  3.  However,  it  uses  a  modification  of 
the  instruction  4  presented  there.  Instruction  4  increases  the  variable  row  by  one  unit  at  a 
time.  The  availability  of  the  sequence  of  triples  5,^  and  MAXLENGTH  allows  increasing  row 
by  much  larger  jumps  as  long  as  we  do  not  require  information  about  symbols  of  the  text, 
which  are  beyond  bj.  Once  row  takes  us  beyond  bj,  5,^  and  MAXLENGTH  do  not  help  us  any 
more  and  we  apply  (the  old)  instruction  4  as  in  the  computation  of  the  previous  algorithm. 

Let  us  now  explain  instruction  4. new.  The  while  loop  of  the  new  instruction  4  looks  for  the 
longest  exact  match  w  between  any  section  of  the  text  bt+raw+d+i  and  some  portion  of  the 
pattern  r^^+^  (as  long  as  j  +  1  <  i  +  row  +  d+1  ^j).  According  to  5,^  (i.e.  the  listing  of 
triples)  fe,4.rOTv+d+i'  •  •  •  •^i+raw-i-d->-f  matches  r^^^,  .  .  .  ,r^+^  for  some  index  c  of  the  pattern. 
In  the  computation  of  w  we  have  the  following  cases: 
Case(a).  /  ^  1.    In  our  example  (i  =20),  assume  that  row  =  8  and  d  =-2  and  we  want  to 

find  what  is  the  longest  exact  match,  w,  between  627.  •  •  •  .  and  r^ Inspecung  ^20,30 

we  find  the  triple  (25,7,3),  meaning  that  nucleotides  27  to  28  of  the  text  match  nucleotides  9 
to  10  of  the  pattern  (/"  =  2  and  c  =  8).    Thus,  this  triple  covers  nucleotide  27.    At  this  stage 


we  refer  to  MAXLENGTH.    MAXLENGTH {c ,row)  gives  the  maximal  number  g  such  that 

a,-+i a,+y       equals       a,,^+i a,^+^.        In       our       example       r^^^.  .  .  .  .rg+j 

=  rg+i rg+y  g  equals  then  to  5.   Case  (a)  has  two  subcases: 

Case(al).     f  *  g.     It     is     easy     to     see     that     here     fe,+r<w+d+i.  •  •  •  '^f+row^d+m/rf/^i 

=  r,^+i ,r,^+,„,„^,^,   and  (fe,.,^*<i+„/„(/,^)+i  ^  ^aw+m/^(^^)  +  l)•   The  longest  match  w 

here  is  min(/,g).  Therefore,  we  assign  w:=  min(f,g),   and  row  increases  by  w.   In  our 
example  w  =  2,  (mjn(2,5)),  and  row  =  row  +  2  =  10. 

Case(a2).  f=g.  This  implies  &,^,^+rf+i !>,+,<„^+d^/  =  r,^+i r,^*/  but  does  not 

reveal  whether  the  next  nucleotide  in  the  text,  namely  b,^^^+^+i  is  equal  to  the  next 

nucleotide  in  the  pattern  r^^^+^+i-    Therefore,  we  assign  row:=  row+f  and  apply  again  the 

present  case  analysis  accumulating  the  "jump"  over/  into  w. 

Case  (b).  f  =  0,  namely  bi+,„^+^+,  is  unmatched  in  the  S,j  triples.  In  this  case  we  compare 

nucleotide  ^,+^0^^^+!  to  its  corresponding  one  in  the  pattern,  r^^^^.  For  example,  compare 

b-^  to  Tj,  (/  =  20  ,  row  =  3  ,  d  -  i).  This  case  has  two  subcases. 

Case(bl).  The  two  nucleotides  differ.  Ir.  this  case  w.=  0. 

Case{b2).  The  two  nucleotides  are  identical.  In  this  case  w  =  1  and  we  assign  row:=  row  +  1, 

and  apply  again  the  present  case  analysis  accumulating  this  propagation  of  1  into  w.   Detailed 

examples  are  given  in  figures  2,3a,b,c,d. 

The  text  analysis  algorithm  is 
;:=0; 

for  i:  =  0  to  n-m  +  k  do 
begin 

[1]   Proper  initialization  (as  in  instruction  1  of  the  0{km)  algorithm) 
[2]  for  e:^0  to  k  do 
for  d:='-e  to  e  do 
begin 

[3]   row  :=  maj:[(Z.rf  ,_,  +  l),(Z,d_i  ,,_n_(Z,rf+;,,_i+l)]. 
[4. new]    while    i-\-row  +  d-'r\  ■^  j     do 

[4.new.l]    take  from  5,^  the  triple  that  "covers"  ^,+^3^+^+1.  Derive  from 
this     triple    the    indices    cj    such    that    ij+^w^d+i ^i+ro^+d+f 

~   ''c+i ''c^f 

[4. new. 2]   if  /  >  1 
then  (*  case  a  *) 

[4.new.3]   if  /  #  MAXLENCTH{c,row) 
then  (•  case  al  *) 

row  :—  row  +  min(J , MAXLENGTH {c , row)) 
go  to  5 
else  (•  case  a2  *) 

row:=  row  +  /  ; 
else  (*  case  b  ') 

[4. new. 4]   if  b,^,^^^^^  *  r,^+i 
then  (•  case  bl  «) 
go  to  5 
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else  (•  case  b2  *) 
row:=  rc»v+l 
od 
[4. old]   while  r^^^^  =  b.^.^+^+j   do 

row:=row  +  l. 
[5]   Lay.=  row. 
[6]   ifirf,,  =  m 

then      print  'YES*  and  go  to  7 
end 
[7]   K  new  symbols  of  the  text  were  reached  0'  was  increased),  then  starting  from  the 
L^jt  (which  implies  the  new  j,  i.e.  j  =  L^^  +  d  +  i)  we  reconstruct  the  new  S,j. 
end 

Impiementation  remarks. 

Instruction  4.new.l:  When  we  compute  L^q  we  start  searching  for  the  indices  (f,c)  at  the  first 
triple  of  S,j.  We  know  which  triple  was  checked  when  any  Z,^ ,  gets  its  value.  So,  when 
computing  a  new  Z.^,,  we  already  know  which  were  the. last  triples  checked  in  the 
computation  of  each  one  of  Z-^_2_,_i,Z-(i_,_:  andLj+i^^,_|c  In  instruction  3  row  got  its  initial 
value  from  the  maximum  of  ^d-:,*-:'-^^,^-:''"  1  ^'^  ^d^i.e-:'^  1-  "^^  ^^^^  triple  checked  in  the 
computation  of  the  maximum  is  the  first  to  be  considered  now,  in  the  computation  of  Lj ,. 
Instruction  7:  If  at  least  one  new  symbol  of  the  text  is  reached,  at  the  end  of  iteration  /  a  new 
sequence  of  triples  is  created  instead  of  5. ,.  If  b,  is  the  rightmost  symbol  of  the  text  reached 
in  such  an  iteration  (j),  then  denote  the  new  sequence  5,j.  For  each  Z.j_,  computed  during 
iteration  i.  we  keep  a  sequence  (of  triples).  This  sequence  "realizes"  L^  ^.  That  is,  it  gives  a 
correspondence  between  r^,  .  .  .  ,r^      and  b,+^,  .  .  .  .fc/^^    -^,  with  exactly  e  differences. 

This  sequence  is  used  in  the  computation  of  5,  _ 

At  the  beginning  of  each  iteration  /,  each  L^ ,  has  an  empty  triple  sequence.  Here,  we  use 
again  the  fact  that  initially  (at  instruction  3)  row  is  the  maximum  among  L^_;^^_^,  Z,;j,_,  +  1, 
and  Z,j^_;,_;+1  and  finally  (at  instruction  5)  is  L^^.  Assume  that  we  know  the  sequences  of 
the  predecessors  of  L^  ,  (namely,  the  sequences  of  Lj_^  ^_^,  Z,^  ,_^  and  L^^-^^^).  We  get  the 
sequence  of  L^^  by  adding  triples  to  the  end  of  the  sequence  of  the  predecessor  which  gives 
the  maximum  in  initializing  row.  Let  /;  be  the  initial  value  of  row.  If  /•  got  its  value  from 
^d-ie-i  (^^  ^dt-d  ^^"  ws  ^^^  ^°  i^s  sequence  the  triple  {i  +  l,  +  d-l,  0,  0).  (Meaning  that 
for  bi^.,  ^j,  there  is  no  corresponding  symbol  in  the  pattern).    Following  instruction  5,  if 

^d.e  >  'i'  w^  "^^^  ^^^  ^s  triple  (i  +  l.  +  dJ,L^_^-l,J  to  the  sequence  of  L^ ,.    This  last  triple 
describes  the  match  between  substrings  of  the  pattern  and  the  text.  It  was  found  during  the 
computation    of  L^ ,    given   Lj_;,_^,   id_,_i   and  Z,^  +  ;,_|.     This   latter   addition   is   done 
regardless  of  whether  the  source  of  l^  was  Z,^_i_,_i,  L^  ,,-i  or  L^+i^^^i- 
At  the  end  of  iteration  i  we  check  which  of  the  sequences  of  the  2A^+1  Z-^^'s  reach  the 
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rightmost  symbol  of  the  text.  If  the  index  of  this  symbol  is  greater  than  ;  (L^^^  +  d+i  >  j), 
we  take  its  sequence  to  be  the  new  Sij. 

The  old  instruction  4  (where  to\v  is  increased  by  one  unit  at  a  time  without  using  Sij 
and  MAXLENGTH)  is  employed  each  time  we  move  to  a  new  symbol  of  the  text.  We 
maintain  0{k)  diagonals  at  any  time  during  the  text  analysis  and  may  need  to  compare  the 
new  symbol  for  each  of  them.  Hence,  the  old  instruction  4  requires  a  total  of  0{kn)  time 
throughout  the  text  cmalysis.  In  order  to  evaluate  the  number  of  steps  which  are  required  by 
the  new  instruction  4  at  iteration  i,  we  use  again  the  fact  that  0{k)  diagonals  are  computed. 
The  sequence  5, ,  has  at  most  2it+l  triples.  We  can  charge  each  operation  performed  on  any 
one  of  the  diagonals  to  either  a  difference  being  discovered  (there  are  ^  ;k  such  differences), 
or  to  a  triple  of  Sij  being  examined  (there,  are  <  2k+\  triples).  This  amounts  to  0{k) 
operations  per  diagonal  at  each  iteration  i.  ITierefore,  the  total  rimning  time  of  the  text 
analysis  is  Oik'-n).  This  implies  that  the  program  will  run  the  fastest  when  only  good 
alignments  are  accepted  (i.e.  k  is  stipulated  to  be  small).  The  actual  alignment  of  the  pattern 
and  text  can  easily  be  produced  if  we  retciin  all  those  5/^  where  the  end  of  the  pattern  (m)  has 
been  reached.  Along  with  each  5,^,  its  corresponding  number  of  differences  (^  k)  can  be 
stored. 

A  preliminary  version  of  the  program  has  been  written  in  Fortran  and  implemented  on 
the  CDC  6600.  This  program  has  been  used  to  calculate  the  figures  given  here. 

5.  Discussion 

This  paper  constitutes  an  additionjil  example  of  a  successful  interdisciplinary  effort 
where  an  algorithm  developed  by  computer  science  designers  is  applied  to  solve  problems 
that  arise  in  molecular  biology.  The  principle  of  the  algorithm  we  have  described,  though 
seemingly  complex,  is  essentially  simple. 

As  we  slide  the  pattern  along  the  text,  we  shall  then  get  several  alignments,  each  with  at 
most  k  differences,  rather  than  only  the  position  in  the  text  of  the  optimal  alignment.  Clearly 
this  is  advantageous  both  in  comparing  amino  acid  and  nucleotide  sequences.  One  may  then 
list  all  alignments  either  by  site  or  by  their  minimal  differences  score,  i.e.  by  how  good  they 
are. 

For  the  problem  of  efficient  string  matching  in  the  presence  insertions  and  deletions,  the 
algorithms  presented  here  are  the  most  efficient  to  date.  Especially  noteworthy  is  the 
algorithm  described  in  section  4.  Whenever  the  text  analysis  (rather  than  the  pattern) 
dominates  the  computation  time,  the  running  time  of  this  algorithm  is  close  to  optimal. 

Very  often  long  sequences  are  involved  in  homology  searches.  Also  very  frequent  is  the 
need  to  compare  one  sequence  (i.e.  the  pattern)  to  many  others. 
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Thus,  the  fast  growing  rate  in  tiie  sequencing  of  DNA  (and  protein)  molecules,  coupled 
with  the  efficiency  of  the  algorithm,  as  well  as  the  listing  it  provides  of  all  alignments  better 
than  a  previously  set  value,  make  it  an  attractive  tool  in  molecular  biology. 
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Fig.  1.  The  D,^  matrix  computed  for  the  strings  R  (of  length  m  =  12)  and  B  (of  length 
n  =  15).  The  match  ending  at  (12,12)  =  3  on  the  m'th  row  thus  yields  the  best 
alignment,  eqn  fatal  error:  converted  token  0  +  0+0+0  +  373...  too  long  file  -,  between 
lines  2361  and  2670 
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Fig,    2.     Alignment    of   the    5    and    R    strings   allowing    at    most    ^; 
differences.  The  L^g's  of  this  computation  are 


here 


ijn    _2    —  ^ 


Z.'Q_i    —    — _    .    Inn    —    2    ,    Lni    —    3   ,    Irnp   —    6   ,    Ln_^    — 


ii.-,  =  -«  ,  I,.c  =  -1  ,  I:.,  =  0  .  I,.,;  =  6  ,  Li 

3  =  7 

Lzx  =  -°°  ,  I2.1  =  -■-  ■  ^2.2  =  5  .  ia,3  =  6 

L2  [    ~    ~°°,Zj32""    ~-    ',^33—5 

/._;  _,  =  -«     Z._i  c  =  C     Z._i,i  =  3  .  I_,.2  =  5 

^-..3 

I_2C=    —       i-2.1    =    -       i-2.2   =    4       1-2,3   =   6 

Z.-2  1   =   — °°       •^-2,2   =   2      -^-3,3   =   6 

The  answer  to  the  question  "can  we  find  an  occurrence  of  string  R  with  at  mcs; 


k  differences  starting  at  6  ,  "  is  ^TS  L 


=  12  The  rightmcs-  ;ymb; 
5i,j  IS  f0,0,2),;2,0,0),;3,2.3),f6,0,G).;'7.7,5)  This  latter  ccmDut::.t:cn  is  explained 
m  section  4  and  is  put  in  now  so  that  this  figure  can  alsc  oe  used  as  the  first 
stage  in  the  text  analysis  algorithm  Now,  we  m.ove  the  pac:ern  to  the  righc  an 
lav  it  on  the  second  svmbol  of  the  text    See  Fig    3a 
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Fig.  3a.  This  figure  explains  the  0[k'^n)  text  analysis  algcrithm 
tion  of  Figure  2.  Thus  here  we  put  the  pattern  [R]  on  the  text 
62   The  resulting  calculation  is 


!t  !S  a  cont 
\B)  at  po5 


1  n  u  a  - 
it!on 


■^C.-2  -   ~ 
il.-l   =   - 

Lz.z  =  — 
•^■2.1  -  "" 


^c.-i  =  - 


^2.1   = 


—    —  00 


^  -1,C 


; 


■2.1  - 


,  Z-c.c  -  0  ,  Ic.i  - 
,  Li.i  =  0  ,  I1.2  =  1 

Z.22  =   0  ,   Z,2.3  ~   3 
i-Z.2  -   0 

=  0,  I-i.i  =  2  ,  i-1,2  =  3 

L_2  2   =    3   ,    1-2.3   —   5 


,   Z-C.2  -  5 
■"1.3   ~   ^ 


i, 


T     =     C 


■-    1,-1 

•^^-2.0   = 

L_3  1   =   — °o   .   L -iz   ~   2  ,    >- -2.2  ""   "^ 

How  we  conr.pute  i;c.2)    ^^xl  =  ma.x[L. 


1,1 


I 


c.r 


=  6 


^11  + 


■  ^ 


roxL    - 


-  o 


■  1  e   co: 


/    =  3 


now   b^,         to  rg  ,  From   5;  ,2     we   get   that   b^b^b^   =  t ^r ^r .^ 

g  =  M.AXLENCTH  [2,2)  =  10   Therefore,  Ic.2  =  ^^il  +  min'J  .g)  =  2  +  3  =5 

How  w^e  compute   L,o    rort/  =  maxyLz^-  ^1.2  "•"  ^'  ^2.2  '•'-)■  ^^"^  -  5    Vi'e  com 


now  b; 


to  r. 


From 


1.1 


2  we  get  that   bg. 


12 


=  r. 


12 


/     =    S 


5  =  MAXLENGTH [7. o)-Q       Therefore,      Lc.2  =  roxL +min  ;/ ,5  )  =  5  +  0  =  5 
answer  to  the  question  posed  in  the  caption  of  Fig    2  is  NO    None  of  tne  L 
equal  to  12  [m).    The  rightmost  symbol  of  the  text  which  was  checked  is  67- 
we  do  not  create  a  new  5i  j  sequence. 
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fig.  3b.    The  pattern  is  now  laid  starting  at  63    (See  legend  of  Fig  3a)    The 
lation  of  the  L's  yields; 


calcu- 


-1,-1    -    ""-    ■   -  l.C   -    ~-    •    -  1.1 


Lq|    —    —  °°,    ij32~-    —  -    ,    Ltt"    ^ 


^-L-l    =    — 
^-2,C=    — 


Z_,.C   =    0,    I. 


l.I 


■1.2 


=  5  ,  i. 


=  6 


Z-. 


i —  21    ~    -    ■    -^—22   ~    ^'    -^—23"    ^ 


The  answer  to  the  question  posed  m  the  legend  of  Fig  2  is  ther  NO     The  right- 
most symbol  is  b  lo 
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Fig.  3c.    The  patte'-n  !?  now  laid  starting  at  64  :see  legend  to  Fig  ''r.'^    Tht   caIcuI 
tion  of  the  L  's  yields 


^c.-2  =  -' 

-3  c  ~    ~°° 
^3.1    -    ~°° 


--2.C   - 


^-3,1 


L. 


-  ~'-  ■  •^•0.0  -  0  ,  Lc  1  -  3  ,  ic,2  -  4  ,  L 


-C.3 


=   ■  7 


.  I,  c  =  -1  .  Z.1.1  =  0  ,  1,2  =  6  ,  L,  3  =  7 

^2.1    =    -'-       -2.2   =    0   '    -^2.3   =    6 
L3  2    =    — -    ,    Lg  3    =    0 

=  ,  Z.-,,c  =0     1.  11  =  1  ,  I_i,2  =  5 

I  _2  1    =    1        ^-2.2   -    5   •    --2.3   -   6 

/.-■jp  =  2,  i-iT^  6 


=    D 


How    we    compute    Zc.3-    ''"^''^  =  max;i_i2 
p?.re  nov.-  fc;.         to  rs  '"rr.m    9 


■^ c. 2  ■•"'-.  -^  1,2  +  -)     ^^"^   =  ~     '''' -    com- 
t-r-.i  e  ;io..   ^  11  ^>^  '  e  •  !"^^  -''1  12  ^'^'^  ^^^  *-^'^'   ^;:^:2  =  ^i;';:    /   =-     "    -   '-■- 

g  =  MA.XLESGTH[:Z  7)  =  2     f   =  g  =  2.    Therefore    rou.  =  rori.  +  /   =  S    and    we 
compare  again    Since,  here  w-e  have  no  information  from  5i  12    we  check  syinbc^ 
by  symbol    We  find  that  b^ob^^b^^  is  equal  7-ic7'iir,2  and  rozi.  =  12     The  answer  to 
the  question  posed  in  the  caption  to  Fig    2  is  YES  ic3  =  -2    The  right m.ost  sym:- 
bcl  is  fcis  and  the  new  5,,^  (53.15)  is:  ',3,0.0), (4, 1 .2)  .(6.0,0)  (7,3,3)  ;  10,7,5) 
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¥\g.  4.    Computation  of  MAXLENGTH      Here,  ^e  describe  the  pattern  analysis 

,  T^    The  -output  of  th-e  pa-tt-em  anaiy-sis  is  ttie 
MAXLENGTHiO m -1,0.    .,m -1  ]   '        where 

and 


The  pattern  is  an  array  f?  =-ri. 
two  dimensional  array 

MAXLENGTH  [x,j)  =  f       means       that      r^  +  i,  ■ '"i*-/  =       '^J  +  i'  ■  "^j +/ 

^1+/  +1  '^  ^;>/  +  i 

The       array       MAXLENGTH       is       symmetric.       That       is.       M.AXLENGTH[z.j)  = 
MAXLENGTH  [j  .i) .  We  will  apply  a  slight  modificatron  of  the  string  matching  algo- 
rithm of  (20)  due  to  (21).    Given  a  pattern  of  length  rn  and  a  text  of  length  n  this 
modification  finds  for  each  entry  of  text  one  of  two  things: 

1,  Whether  an  occurrence  of  the  pattern  starts  at  this  entry 

2,  The  first  mismatch  (from  the  left)    That  is,  it  finds  the  first  character  of  the 
text  which  differs  from  a  character  of  the  pattern, 

*\>  separately  compute  each  row  i  of  MAXLENGTH .  Take  r^  +  ,,  ,  r^  to  be  the 


pattern  and  r, 


r^    to  be  the  text.  We  compute   MAXLENGTH 


m  — .  j 
simply  by  applying  this  modification  The  computation  of  each  row  takes  0[m) 
time    Since,  There  are  m  rows  the  total  time  is  0{Tn^) 
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